2 Tests of Autocorrelation

2.1 Autocorrelation Function (ACF) and Partial Autocorrelation Function (PAC

This section is for generating Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of the time series data. The function acf2 is used here.

No Systematic Trends: Think of it as your data hovering around a roughly constant average value over time. No clear upward or downward drifts.
Variance is Consistent: Your data doesn’t get systematically more ‘spread out’ as time goes on.
Why It Matters: Lots of time series analysis methods, including ARIMA modeling, often perform best when your data is stationary (or you’ve made transformations to achieve this).

Let’s take a look at the ACF and PACF plots of this series in order to identify the best model for this series.

Reveal the Secret Within

# ACF plot for Stationary Time Series 
acf_stationary <- astsa::acf2(stationary_ts, main="ACF of Stationary Time Series")

Interpreting the ACF Plot (acf_stationary)

Significant Spikes: Are there tall bars exceeding the blue dashed lines early on? This implies correlation at specific lags (e.g., today is similar to yesterday).
Decaying Pattern: Does the spike height drop rapidly, with most bars within the dashed lines after a few lags? This is common for stationary series, as further ‘echoes’ in time get fainter.
No Obvious Seasonality: If data had monthly cycles, your ACF would reflect it with repeated spikes every 12 lags. You likely won’t see this here.

What about the PACF?

The PACF would likely not show many major spikes beyond the first couple. That tells us, once you account for very short-term correlation, your ‘echoes’ mostly disappear!

2.2 Statistical Tests for Autocorrelation

Other common methods is to use the Box-Ljung test or the Durbin-Watson test. These tests are used to null hypothesis that there is no autocorrelation in the data. If the p-value of the test is less than a certain significance level (e.g., 0.05), then we can reject the null hypothesis and conclude that there is autocorrelation in the data.

Durbin-Watson Test (dwtest): A formal statistical test to detect if there’s significant autocorrelation in the residuals.
Ljung-Box Test (Box.test): The Ljung-Box test checks for autocorrelation in the residuals of a time series model. Autocorrelation here means that the residuals (errors) of the model are correlated with each other at different lags.

Null Hypothesis:

H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).

Here are some steps you can take to check for autocorrelation in your time series data:

Fit a linear regression model to your data.

Reveal the Secret Within

# Simple Linear Regression Model 
stationary_simple <- stats::lm(value ~ t, data = stationary_df)  
# View Model Summary 
summary(stationary_simple)


Call:
stats::lm(formula = value ~ t, data = stationary_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5272  -2.8354  -0.2433   2.9391  11.1638 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 50.581625   0.823364  61.433 <0.0000000000000002 ***
t           -0.008337   0.011810  -0.706               0.482    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.482 on 118 degrees of freedom
Multiple R-squared:  0.004206,  Adjusted R-squared:  -0.004233 
F-statistic: 0.4984 on 1 and 118 DF,  p-value: 0.4816

Calculate the residuals from the model.

Reveal the Secret Within

stationary_simple_residuals <- resid(stationary_simple)  
## Visual Check for Patterns 
plot(resid(stationary_simple)) + title(main = "Residual Plot")

integer(0)

Plot the autocorrelation function (ACF) of the residuals.

Reveal the Secret Within

## Autocorrelation Check   
astsa::acf2(resid(stationary_simple), main= paste0("Autocorrelation Function (ACF) of Residuals"))

     [,1]  [,2] [,3]  [,4] [,5] [,6] [,7]  [,8]  [,9] [,10] [,11] [,12] [,13]
ACF  0.01 -0.08 0.12 -0.08 0.04 0.04 0.02 -0.02 -0.08 -0.08  0.06 -0.12 -0.11
PACF 0.01 -0.08 0.12 -0.09 0.06 0.01 0.05 -0.04 -0.07 -0.10  0.06 -0.14 -0.09
     [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
ACF   0.13 -0.08 -0.09  0.12 -0.08 -0.02 -0.06  0.01
PACF  0.09 -0.05 -0.07  0.09 -0.07  0.01 -0.11  0.04

Perform a Durbin-Watson test on the residuals.

Reveal the Secret Within

## Durbin-Watson Test   
lmtest::dwtest(stationary_df$value ~ stationary_df$t)


    Durbin-Watson test

data:  stationary_df$value ~ stationary_df$t
DW = 1.9701, p-value = 0.3979
alternative hypothesis: true autocorrelation is greater than 0

Perform a Box-Ljung test on the residuals.

Reveal the Secret Within

# Box-Ljung test 
Box.test(stationary_simple$residuals, lag = 24, type = "Ljung-Box")


    Box-Ljung test

data:  stationary_simple$residuals
X-squared = 20.721, df = 24, p-value = 0.6551

2.3 Important Note Regarding Autocorrelation

In time series analysis, especially when dealing with ARIMA models, the distinction between the autocorrelation in the observed time series and the autocorrelation in the residuals is crucial. A significant part of model validation involves checking for autocorrelation in residuals to ensure the model is appropriately fitted to the data.

2.3.1 Observed Time Series

The observed time series often has autocorrelation, and that’s expected. The goal of time series modeling is to capture this autocorrelation. Many real time series inherently display autocorrelation. This means that the current value in the series is correlated with its previous values. For instance, today’s temperature is often similar to yesterday’s, or sales in one month might be influenced by sales in the previous months. This autocorrelation is precisely why we use models like ARIMA. These models are designed to capture and explain such autocorrelation in the data.

Accurate Parameter Estimates: Autocorrelation, if present in your data, violates assumptions of standard models like OLS. To get reliable estimates of things like the intervention effect in ITS, you need to account for this dependence between observations over time.

Valid Inference: Statistical tests, p-values, and confidence intervals rely on certain assumptions about errors. Autocorrelation invalidates these, so without modeling it, your conclusions might be incorrect.

2.3.2 Residuals

The residuals of the model should ideally not be autocorrelated. If they are, it implies that the model has missed some aspect of the data’s structure, and there might be room for improvement in the model. After fitting a time series model like ARIMA, the residuals (the differences between the observed values and the model’s predicted values) should show no autocorrelation.

If residuals are autocorrelated, it suggests that the model has not fully captured all the information in the time series, particularly the patterns or structures related to time. Essentially, it means there’s still some predictable aspect left in the residuals, which should have been accounted for by the model.

Ideal residuals should resemble white noise, meaning they should be random and not exhibit any discernible patterns or trends. This indicates a well-fitting model.

Model Diagnostics: Non-autocorrelated residuals are one indicator of a well-specified model. If autocorrelation remains in the residuals, it suggests that there’s more structure in the data that your model isn’t capturing.

Improved Forecasting: For many time series applications, the goal is forecasting future values. Models that leave autocorrelation in the residuals may produce less accurate forecasts, as they’re not fully understanding the patterns.

2.3.3 Illustrative Example

Imagine fitting a simple linear regression to estimate the effect of an intervention in ITS. If your time series data exhibits positive autocorrelation (positive values tend to follow other positive values), your standard errors in OLS will be too small, making the intervention effect appear more statistically significant than it potentially is.

Caveats and Nuances:

Sometimes, autocorrelation is merely a symptom of non-stationarity in your time series. In that case, modeling autocorrelation alone won’t solve the problem - you might need differencing or other transformations.

Even with appropriate modeling, sometimes slight residual autocorrelation persists. Statistical tests might flag it, but a trade-off is made between complexity and parsimony.

2.3.4 In Summary

While the ultimate goal is to identify a model that explains the patterns in your time series thoroughly, we actively model the dependency between values across time to get a valid, reliable model whose residuals behave more in line with statistical assumptions.

# Tests of Autocorrelation ```{r libraries-1,message = FALSE,warning=FALSE, echo=FALSE} # Load necessary libraries # tsibble: A 'Tidy' Time Series Data Structure # Provides an organized framework for manipulating, storing, and working with time series data in R. library(tsibble) # feasts: Feature Extraction and Statistics for Time Series # Offers specialized functions for extracting valuable features (e.g., trend, seasonality) from time series data and facilitates statistical analysis of these features. library(feasts) # fable: Forecasting Models for Tidy Time Series # Contains flexible methods and tools for developing forecasting models, specifically tailored to work with time series data represented in the tsibble structure. library(fable) # astsa: for Applied Statistical Time Series Analysis # Provides functions and datasets specifically geared towards time series analysis and its practical applications. library(astsa) # lmtest: for diagnostic testing in linear regression models # Offers tools to validate assumptions of linear regression models and identify potential issues. library(lmtest) # forecast: for time series forecasting # Contains methods for automated forecasting techniques like exponential smoothing and ARIMA modeling. library(forecast) # dplyr: for data manipulation # Streamlines data transformation, filtering, and summarizing tasks with an intuitive syntax. library(dplyr) # zoo: for time series data handling # Provides features for irregular time series and specialized methods for managing such data. library(zoo) # tidyverse: an opinionated collection of R packages for data science # Includes packages like dplyr, ggplot2 for data manipulation, visualization, and other core data science tasks. library(tidyverse) # tidymodels: for modeling and statistical analysis # A unified framework for diverse modeling approaches, supporting consistent workflows and syntax. library(tidymodels) # car: for Companion to Applied Regression # Provides datasets and functions focused on regression analysis, with emphasis on diagnostics and visualization. library(car) # stats: for statistical functions in R (usually loaded by default) # Offers a fundamental set of statistical tools for modeling, analysis, and testing. library(stats) # extraDistr: extends the range of statistical distributions available # Adds a variety of distributions beyond those in the base R installation for more nuanced modeling or simulations. library(extraDistr) # ggplot2: A 'ggplot2' extension that enables the rendering of complex formatted plot labels (titles, subtitles, facet labels, axis labels, etc.). Text boxes with automatic word wrap are also supported. library(ggplot2) # ggtext: Rich Text Formatting for 'ggplot2' Graphics # Extends the graphical capabilities of 'ggplot2' by allowing you to use complex HTML and Markdown-like formatting styles within plot elements (titles, labels, captions, annotations, etc.). library(ggtext) # patchwork: Combine 'ggplot2' Plots # Streamlines arrangement of multiple 'ggplot2' (or related) plots into a single layout for clear and concise visualizations. library(patchwork) # gridExtra: Functions in Grid Graphics # Facilitates arrangement and customization of multiple ggplot2 (or other grid-based) plots in a single layout. library(gridExtra) # Set random seed for reproducibility of results set.seed(123) options(scipen = 999) # Generate a Stationary Time Series # Ensure your results are reproducible set.seed(123) # Simulate 120 data points stationary_ts <- ts(rnorm(120, mean = 50, sd = 5), frequency = 12) # Create a tsibble for it stationary_df <- tsibble::as_tsibble(stationary_ts) stationary_df <- dplyr::mutate(stationary_df, t = dplyr::row_number()) ``` ## Autocorrelation Function (ACF) and Partial Autocorrelation Function (PAC This section is for generating Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of the time series data. The function **`acf2`** is used here. - **No Systematic Trends:** Think of it as your data hovering around a roughly constant average value over time. No clear upward or downward drifts. - **Variance is Consistent:** Your data doesn't get systematically more 'spread out' as time goes on. - **Why It Matters:** Lots of time series analysis methods, including ARIMA modeling, often perform best when your data is stationary (or you've made transformations to achieve this). Let's take a look at the ACF and PACF plots of this series in order to identify the best model for this series. ```{r stationary-acf,error = FALSE,message = FALSE,warning=FALSE} # ACF plot for Stationary Time Series acf_stationary <- astsa::acf2(stationary_ts, main="ACF of Stationary Time Series") ``` **Interpreting the ACF Plot (`acf_stationary`)** 1. **Significant Spikes:** Are there tall bars exceeding the blue dashed lines early on? This implies correlation at specific lags (e.g., today is similar to yesterday). 2. **Decaying Pattern**: Does the spike height drop rapidly, with most bars within the dashed lines after a few lags? This is common for stationary series, as further 'echoes' in time get fainter. 3. **No Obvious Seasonality:** If data had monthly cycles, your ACF would reflect it with repeated spikes every 12 lags. You likely won't see this here. **What about the PACF?** - The PACF would likely not show many major spikes beyond the first couple. That tells us, once you account for very short-term correlation, your 'echoes' mostly disappear! ## Statistical Tests for Autocorrelation Other common methods is to use the Box-Ljung test or the Durbin-Watson test. These tests are used to null hypothesis that there is no autocorrelation in the data. If the p-value of the test is less than a certain significance level (e.g., 0.05), then we can reject the null hypothesis and conclude that there is autocorrelation in the data. - **Durbin-Watson Test (`dwtest`):** A formal statistical test to detect if there's significant autocorrelation in the residuals. - **Ljung-Box Test `(Box.test)`:** The Ljung-Box test checks for autocorrelation in the residuals of a time series model. Autocorrelation here means that the residuals (errors) of the model are correlated with each other at different lags. **Null Hypothesis:** - H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process). Here are some steps you can take to check for autocorrelation in your time series data: 1. Fit a linear regression model to your data. ```{r stationary-lm-acf,error = FALSE,message = FALSE,warning=FALSE} # Simple Linear Regression Model stationary_simple <- stats::lm(value ~ t, data = stationary_df) # View Model Summary summary(stationary_simple) ``` 2. Calculate the residuals from the model. ```{r residuals-acf,error = FALSE,message = FALSE,warning=FALSE} stationary_simple_residuals <- resid(stationary_simple) ## Visual Check for Patterns plot(resid(stationary_simple)) + title(main = "Residual Plot") ``` 3. Plot the autocorrelation function (ACF) of the residuals. ```{r acf-series-1,error = FALSE,message = FALSE,warning=FALSE} ## Autocorrelation Check astsa::acf2(resid(stationary_simple), main= paste0("Autocorrelation Function (ACF) of Residuals")) ``` 4. Perform a Durbin-Watson test on the residuals. ```{r DurbinWatson-1,error = FALSE,message = FALSE,warning=FALSE} ## Durbin-Watson Test lmtest::dwtest(stationary_df$value ~ stationary_df$t) ``` 5. Perform a Box-Ljung test on the residuals. ```{r BoxLjung-1,error = FALSE,message = FALSE,warning=FALSE} # Box-Ljung test Box.test(stationary_simple$residuals, lag = 24, type = "Ljung-Box") ``` ## Important Note Regarding Autocorrelation In time series analysis, especially when dealing with ARIMA models, the distinction between the autocorrelation in the observed time series and the autocorrelation in the residuals is crucial. A significant part of model validation involves checking for autocorrelation in residuals to ensure the model is appropriately fitted to the data. ### Observed Time Series The **observed time series** often has autocorrelation, and that's expected. The goal of time series modeling is to capture this autocorrelation. Many real time series inherently display autocorrelation. This means that the current value in the series is correlated with its previous values. For instance, today's temperature is often similar to yesterday's, or sales in one month might be influenced by sales in the previous months. This autocorrelation is precisely why we use models like ARIMA. These models are designed to capture and explain such autocorrelation in the data. **Accurate Parameter Estimates:** Autocorrelation, if present in your data, violates assumptions of standard models like OLS. To get reliable estimates of things like the intervention effect in ITS, you need to account for this dependence between observations over time. **Valid Inference:** Statistical tests, p-values, and confidence intervals rely on certain assumptions about errors. Autocorrelation invalidates these, so without modeling it, your conclusions might be incorrect. ### Residuals The **residuals** of the model should ideally not be autocorrelated. If they are, it implies that the model has missed some aspect of the data's structure, and there might be room for improvement in the model. After fitting a time series model like ARIMA, the residuals (the differences between the observed values and the model's predicted values) should show no autocorrelation. If residuals are autocorrelated, it suggests that the model has not fully captured all the information in the time series, particularly the patterns or structures related to time. Essentially, it means there's still some predictable aspect left in the residuals, which should have been accounted for by the model. Ideal residuals should resemble white noise, meaning they should be random and not exhibit any discernible patterns or trends. This indicates a well-fitting model. **Model Diagnostics:** Non-autocorrelated residuals are one indicator of a well-specified model. If autocorrelation remains in the residuals, it suggests that there's more structure in the data that your model isn't capturing. **Improved Forecasting:** For many time series applications, the goal is forecasting future values. Models that leave autocorrelation in the residuals may produce less accurate forecasts, as they're not fully understanding the patterns. ### Illustrative Example Imagine fitting a simple linear regression to estimate the effect of an intervention in ITS. If your time series data exhibits positive autocorrelation (positive values tend to follow other positive values), your standard errors in OLS will be too small, making the intervention effect appear more statistically significant than it potentially is. **Caveats and Nuances:** Sometimes, autocorrelation is merely a symptom of non-stationarity in your time series. In that case, modeling autocorrelation alone won't solve the problem - you might need differencing or other transformations. Even with appropriate modeling, sometimes slight residual autocorrelation persists. Statistical tests might flag it, but a trade-off is made between complexity and parsimony. ### In Summary While the ultimate goal is to identify a model that explains the patterns in your time series thoroughly, we actively model the dependency between values across time to get a valid, reliable model whose residuals behave more in line with statistical assumptions.